1. Introduction
1.1 Data loading
1.2 Data scaling
2. Broad analysis
2.1 Boxplots
2.2 Densityplot
2.3 K-means clustering
2.4 PCA
3. Specific analysis: Lapatinib
3.1 Analysis of biomarker and t-test between treated and untreated cells
3.1.1 Expression of biomakers before and after the treatment of Lapatinib
3.1.2 Searching for our own biomarkers
3.1.3 Boxplot of our biomarkers
3.1.4 ggBoxplot of our biomarkers’ basalexpression in breast cells with Kruskal-Wallis test
3.1.5 Heatmap of biomarkers in breast cells
3.2. Paired t-test between the Lapatinib treated und untreated data
3.2.1 Genome wide paired t-test
3.2.2 Paired t-test of our biomarker before and after treatment
3.2.3 Paired t-test over genes, increased confidence level (99 %)
3.2.4 Visualization of the genome-wide paired t-test in a vulcano plot
3.3 K-means
3.4 PCA
3.5 Spearman correlation
4. Main questions
4.1 Question 1: Does the cell doubling time correlate with reduced drug sensitivity?
4.1.1 Linear regression ~ Inoculation_Density
4.1.2 Linear regression ~ Doubling-time
4.1.3 Multiple regression
4.1.4 Table of conclusion
4.2 Question 2: Does Lapatinib have a similar effect on lung cancer as Erlotinib?
4.2.1 Heatmap
4.2.2 Plot Erlotinib and Lapatinib, colored by tissue
4.2.3 Pearson correlation
4.2.4 Lung cellines
4.2.5 Anova
4.3 Question 3: Does Lapatinib have a similar effect on brain metastases as it does on breast cancer?
4.3.1 Volcano plot
4.3.2 Heatmap for visualization
4.3.3 Investigation of correlation
Our project is about Lapatinib and divided into three parts: Broad analysis, specific analysis and our three main questions. The main questions are the most important part of our project. They address whether cell doubling time correlates with reduced drug sensitivity and the effect of lapatinib on lung cancer and brain metastases. Twenty-five per cent of breast cancers overexpress ErbB2 / HER which leads to a more aggressive phenotype. Lapatinib blocks tyrosine kinases that belong to the HER2 receptor, which is increasingly found on the cell surface of cancer cells. HER2 is a receptor for the epidermal growth factor (EGF), which stimulates the cell division of cancer cells. By blocking these HER2 receptors, the unregulated growth of cancer cells can be controlled again. Lapatinib binds the ATP-binding site of the receptor’s intracellular domain and inhibits the survival and proliferation pathways of the tumor cell.
After checking for normalization, we scaled our data in the first place to provide the scaled data for further analysis.
list = list(Treated,Untreated)
nlist = lapply(list,scale)
Treated = as.data.frame(nlist[[1]])
Untreated = as.data.frame(nlist[[2]])
Fold_Change = Treated - Untreated
Fold_Change = data.frame(Fold_Change)
rm(NCI_TPW_gep_treated,NCI_TPW_gep_untreated,list,nlist)
n= as.data.frame(t(NegLogGI50))
rmv.rows = apply(n, 1, function(x) {
sum(is.na(x))
})
NLGI50.all = n[-which(rmv.rows > 0), ] # Removing any row with 1 or more missing values
rm(rmv.rows, n)
# used for second main question
As the very first step in our data analysis, we checked the normalization of the data.
In order to check whether there is a connection between the various drugs, we colored the plot by drug.
As shown, there is a connection between the different “boxes” and the different drugs. This could be the reason for different batches so we scale the data.
As you can see the scaling is necessary to compensate the batch effect. Scaling is then also applied on Untreated before it is further analyzed.
As an other method to get in touch with the data, we checked if there is a difference between the treated and untreated genexpression. The abline shows the 3 quantiles ( 25 % 50 % 75 % )
As assumed, the expression patterns don’t differentiate much.
To look for clusters in the raw data we performed a k-means clustering and searched for potentially clusters using the most variable (>75 % Var), and thus the most informative genes.
Running a loop for the best cluster-amount (searching for an “elbow”)
As we wanted an “elbow”, it is hard to determine the amount of clusters.
Since the clustering method by performing k-means clustering was not successful, we performed a principal component analysis for dimensionality reduction as another method for identifying clusters or patterns while preserving as much information of the original data set as possible.
pca <- prcomp(t(Fold_Change), scale = TRUE)
By plotting a scree plot of the variation percentage, we can display how much variation a principal component captures from the original data.
pca.var <- pca$sdev^2 # sdev calculates variation each PC accounts for
pca.var.per <- round(pca.var/sum(pca.var)*100, 1)
# since percentages make more sense than normal variation values calculate % of variation
plot(pca.var.per[1:10], main = "Scree plot", type = "b", xlab = "Principal Components", ylab = "% variation")
plot(cumsum(pca.var.per[1:15]), main = "cumulative variation", type = "l", xlab = "Principal Components", ylab = "% variation")
As seen in the scree plot, the first three or four PCs have capture most of the information since the curve bends at an “elbow” at PC4 - this is our cutting off point.
# creating data frame with all PCs
# cleaning up sample names as they differ between matrices
pca.data <- data.frame(pca$x)
rownames(pca.data) <- gsub(x = rownames(pca.data), pattern = "X786", replacement = "786")
pca.data <- cbind(sample =rownames(pca.data), pca.data)
In order to color our clusters according to sample annotations, we created a color vector by dissecting the metadata matrix by our PCA data samples. Additionally we were fixing some labeling problems.
### Metadata color matrix for coloring
Metadata$sample <- gsub(x = Metadata$sample, pattern = "-", replacement = ".")
metad.cl <- subset(Metadata, Metadata$sample %in% pca.data$sample)
## adjust row length of metadata to pca.data
metad.cl$mechanism <- Drug_Annotation$Mechanism[match(metad.cl$drug, Drug_Annotation$Drug)]
metad.cl$msi <- Cellline_Annotation$Microsatellite_instability_status[match(metad.cl$cell, Cellline_Annotation$Cell_Line_Name)]
Subsequently we created the color vectors for coloring according to drug and tissue type.
# plotting all informative PCs
#color vectors for coloring by drug and tissue
#drug
dcol <- rainbow(15)
color_drug = dcol[metad.cl$drug]
drug <- levels(metad.cl$drug)
#tissue
tcol <- c("springgreen", "olivedrab4", "yellow2", "orange", "red2", "orchid1", "purple", "turquoise2", "navy")
color_tissue = tcol[metad.cl$tissue]
tissue <- levels(metad.cl$tissue)
To determine clusters of samples based on their similarity, we plotted PC1 and PC2 as well as a PC2 and PC3 while coloring by tissue type and drug respectively.
par(mfrow=c(1,2), cex=0.7)
## colored by tissue
#plot PC1 and PC2
par(mfrow=c(1,2), cex=0.7)
plot(pca$x[,1],
pca$x[,2],
col = color_tissue,
pch = 19,
xlab = paste("PC1 (",pca.var.per[1],"%)"),
ylab = paste("PC2 (",pca.var.per[2],"%)"))
#create legend
legend("topleft",
legend = tissue,
col = tcol,
pch = 19,
xpd = "TRUE",
bty = "n",
cex = 0.75
)
#create title
mtext("PCA of Fold Change colored by tissue",
side = 3,
line = -2,
cex = 1.2,
font = 2,
outer = TRUE)
#plot PC2 and PC3
plot(pca$x[,2],
pca$x[,3],
col = color_tissue,
pch = 19,
xlab = paste("PC2 (",pca.var.per[2],"%)"),
ylab = paste("PC3 (",pca.var.per[3],"%)"))
#create legend
legend("right",
legend = tissue,
col = tcol,
pch = 19,
xpd = "TRUE",
bty = "n",
cex = 0.75,
inset = c(0, 2)
)
#create title
mtext("PCA of Fold Change colored by tissue",
side = 3,
line = -2,
cex = 1.2,
font = 2,
outer = TRUE)
par(mfrow=c(1,2), cex=0.7)
## colored by drug
#plot PC1 and PC2
plot(pca$x[,1],
pca$x[,2],
col = color_drug,
pch = 19,
xlab = paste("PC1 (",pca.var.per[1],"%)"),
ylab = paste("PC2 (",pca.var.per[2],"%)"))
#create legend
legend("topleft",
legend = drug,
col = dcol,
pch = 19,
xpd = "TRUE",
bty = "n",
cex = 0.75
)
#create title
mtext("PCA of Fold Change colored by drug",
side = 3,
line = -2,
cex = 1.2,
font = 2,
outer = TRUE)
#plot PC2 and PC3
plot(pca$x[,2],
pca$x[,3],
col = color_drug,
pch = 19,
xlab = paste("PC2 (",pca.var.per[2],"%)"),
ylab = paste("PC3 (",pca.var.per[3],"%)"))
#create legend
legend("right",
legend = drug,
col = dcol,
pch = 19,
xpd = "TRUE",
bty = "n",
cex = 0.75,
inset = c(0, 2)
)
#create title
mtext("PCA of Fold Change colored by drug",
side = 3,
line = -2,
cex = 1.2,
font = 2,
outer = TRUE)
As we can see, in contrast to the k-means method, clusters can be determined. Respecting the color annotation, it is clear to say that the samples cluster according to drug treatment.
Treated_t<-data.frame(t(Treated))
Untreated_t<-data.frame(t(Untreated))
LapatinibUntreated<-dplyr::select(Untreated, contains('Lapa'))
LapatinibTreated<-dplyr::select(Treated, contains('Lapa'))
before=as.matrix(LapatinibUntreated)
after=as.matrix(LapatinibTreated)
LapUn<-data.frame(t(before))
LapTreat<-data.frame(t(after))
Unfortunately, we can not use them as biomarkers so we search for our own biomarkers.
Treat<-colMeans(LapTreat, na.rm = TRUE)
Untreat<-colMeans(LapUn, na.rm = TRUE)
ut<-(Treat-Untreat)
ut_sort<-sort(ut, decreasing = TRUE)
head(ut_sort)
## GDF15 NUPR1 DDIT3 TRIB3 ATF3 ASNS
## 0.7150148 0.6843715 0.5794749 0.5753359 0.5603209 0.5305063
GDF3 Growth Differentiation Factor 3 belongs to the transforming growth factor beta (TGF-beta) superfamily and has some intrinsic activity.
NUPR1 stands for nuclear protein, transcriptional regulator, 1. It regulates the transcription during stress signals by empowering cells with resistance to stress.
DDIT3 encodes for the DNA damage-inducible transcript 3 protein, a member of the CCAAT/enhancer-binding protein which is a transcription factor. It is important for the response of cell stresses. The DNA damage-inducible transcript 3 protein induces cell cycle arrest and apoptosis in response to ER stress.
TRIB3 stands for Tribbles Pseudokinase 3. The kinase is induced by the transcription factor NF-kappaB. It is essential for cell proliferation.
ATF3 stands for activating transcription factor 3. It belongs to the mammalian activation transcription factor/cAMP responsive element-binding (CREB). ATF-3 is induced by physiological stress.
ASNS stands for Asparagine Synthetase: The enzyme catalyzes the production of the amino acid L-asparagine.
boxplot(LapTreat$GDF15,LapUn$GDF15,
LapTreat$NUPR1, LapUn$NUPR1,
LapTreat$DDIT3, LapUn$DDIT3,
LapTreat$TRIB3, LapUn$TRIB3,
LapTreat$ATF3,LapUn$ATF3,
LapTreat$ASNS, LapUn$ASNS,
col=c("deeppink4", "deeppink4", "burlywood1", "burlywood1", "darkcyan", "darkcyan", "pink", "pink", "darkolivegreen3", "darkolivegreen3", "lightblue", "lightblue"),
names = c("GDF15 before", "GDF15 after", "NUPR1 before", "NUPR1 after ", "DDIT3 before", "DDIT3 after" ,"TRIB3 before","TRIB3 after", "ATF3 before", "ATF3 after", "ASNS before", "ASNS after"),
main="Expression of Biomakers before and after treatment with Lapatinib",
xlab="Biomarkers",
ylab="Expression")
breastcells<-subset(Metadata, tissue == "Breast")
breast<-subset(CCLE_basalexpression, breastcells$cell %in% colnames(CCLE_basalexpression))
tbreastcells<-data.frame(t(breast))
breast_marker <- c(tbreastcells$GDF15,tbreastcells$NUPR1, tbreastcells$DDIT3, tbreastcells$TRIB3, tbreastcells$ATF3, tbreastcells$ASNS)
breast_marker_matrix <- data.frame(tbreastcells$GDF15,tbreastcells$NUPR1, tbreastcells$DDIT3, tbreastcells$TRIB3, tbreastcells$ATF3, tbreastcells$ASNS)
names(breast_marker_matrix)[names(breast_marker_matrix) == "tbreastcells.GDF15"] <- "GDF15"
names(breast_marker_matrix)[names(breast_marker_matrix) == "tbreastcells.NUPR1"] <- "NUPR1"
names(breast_marker_matrix)[names(breast_marker_matrix) == "tbreastcells.DDIT3"] <- "DDIT3"
names(breast_marker_matrix)[names(breast_marker_matrix) == "tbreastcells.TRIB3"] <- "TRIB3"
names(breast_marker_matrix)[names(breast_marker_matrix) == "tbreastcells.ATF3"] <- "ATF3"
names(breast_marker_matrix)[names(breast_marker_matrix) == "tbreastcells.ASNS"] <- "ASNS"
Biomarkers<-c(rep('DF15',45),rep('NUPR1',45),rep('DDIT3',45),rep('TRIB3',45),rep('ATF3',45),rep('ASNS',45))
Expression<-c(tbreastcells$GDF15,tbreastcells$NUPR1, tbreastcells$DDIT3, tbreastcells$TRIB3, tbreastcells$ATF3, tbreastcells$ASNS)
df_breast<-data.frame(Expression, Biomarkers)
ggboxplot(df_breast, x="Biomarkers", y="Expression", color = "Biomarkers",
add = "jitter", legend = "none")+
rotate_x_text(angle = 45)+
stat_compare_means(method = "kruskal.test", label.y = 15)+ # Add global kurskal wallis test p-value
stat_compare_means(label = "p.signif", method = "t.test",
ref.group = ".all.", hide.ns = TRUE) # Pairwise comparison against all
pheat_breast<-pheatmap(breast_marker_matrix ,
show_rownames = FALSE,
show_colnames =TRUE,
cutree_cols = 4,
cutree_rows = 3,
color = viridis(4),
drop_levels = TRUE,
border_color = NA,
clustering_method = "ward.D2",
scale="row")
pheat_breast
t.test(before, after, paired=TRUE)
##
## Paired t-test
##
## data: before and after
## t = -1.5497e-13, df = 718140, p-value = 1
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -0.0003680021 0.0003680021
## sample estimates:
## mean of the differences
## -2.909656e-17
before_biomarker<-LapUn %>% dplyr::select("GDF15", "NUPR1", "DDIT3", "TRIB3", "ATF3", "ASNS")
after_biomarker<-LapTreat %>% dplyr::select("GDF15", "NUPR1", "DDIT3", "TRIB3", "ATF3", "ASNS")
table<-data.table(col_t_paired(before_biomarker, after_biomarker, alternative = "two.sided", mu = 0,conf.level = 0.95))
Marker=c("GDF15", "NUPR1", "DDIT3", "TRIB3", "ATF3", "ASNS")
pvalue<-table$pvalue
table[, .(Marker,pvalue)]
## Marker pvalue
## 1: GDF15 2.612447e-12
## 2: NUPR1 1.327187e-11
## 3: DDIT3 9.979311e-20
## 4: TRIB3 6.995573e-20
## 5: ATF3 2.142510e-12
## 6: ASNS 2.207251e-14
row_t_test<-as.data.frame(row_t_paired(before, after, alternative = "two.sided", mu = 0,conf.level = 0.99))
summary(row_t_test$pvalue)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0000000 0.0002317 0.0756336 0.2421297 0.4428285 0.9998899
Lap_t_test<-data.table(col_t_paired(LapUn, LapTreat, alternative = "two.sided", mu = 0,conf.level = 0.95))
padj<-p.adjust(Lap_t_test$pvalue, "BH")
Lapatinib_fc<-as.data.frame.numeric(colMeans(LapUn-LapTreat))
Lgenes<-(colnames(LapUn))
vulcano<-data.frame(genes=Lgenes,Fold=Lapatinib_fc, pvalue=padj)
names(vulcano)[names(vulcano) == "colMeans.LapUn...LapTreat."] <- "log2FoldChange"
with(vulcano, plot(log2FoldChange, -log10(pvalue), pch=20, main="Volcano plot", xlim=c(-2.5,2)))
# Add colored points: red if padj<0.05, blue if log2FC>1, green if both)
with(subset(vulcano, padj<.05 ), points(log2FoldChange, -log10(pvalue), pch=20, col="red"))
with(subset(vulcano, abs(log2FoldChange)>1), points(log2FoldChange, -log10(pvalue), pch=20, col="blue"))
with(subset(vulcano, padj<.05 & abs(log2FoldChange)>1), points(log2FoldChange, -log10(pvalue), pch=20, col="green"))
# Label points with the textxy function from the calibrate plot
with(subset(vulcano, padj<.05 & abs(log2FoldChange)>1), textxy(log2FoldChange, -log10(pvalue), labs = genes, cex=.8))
LapatinibFold = select(Fold_Change, contains("Lapa"))
#Determining the number of clusters
topVarFold = apply(LapatinibFold, 1, var)
summary(topVarFold)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0007514 0.0090894 0.0138573 0.0197222 0.0226351 0.4897715
# Using the most variable, thus informative genes
topVarFold75 = LapatinibFold[topVarFold > quantile(topVarFold, probs = 0.75), ]
dim(topVarFold75)
## [1] 3325 54
km = kmeans(x = t(topVarFold75), centers = 2, nstart = 10)
km$tot.withinss
## [1] 6627.055
km = kmeans(x = t(topVarFold75), centers = 3, nstart = 10)
km$tot.withinss
## [1] 6159.495
#running a loop for the best n (searching for "elbow")
wss = (sapply(2:7, function(k) {
kmeans(x =t(topVarFold75), centers = k)$tot.withinss}) )
plot(2:7, wss, type = "b", pch = 19, xlab = "Number of clusters K", ylab = "Total within-clusters sum of squares", main = "Determining the amount of clusters from Foldchange")
# Using the silhouett method
D = dist(t(topVarFold75))
km = kmeans(x = t(topVarFold75), centers = 3, nstart = 10)
s = silhouette(km$cluster, D)
plot(s)
Since the clustering method by performing k-means clustering again was not successful, we performed a principal component analysis over the selected Lapatinib samples for identifying clusters or patterns while preserving as much information of the original data set as possible.
L_fc <- select(Fold_Change, contains("Lapa"))
# PCA
pca <- prcomp(t(L_fc), scale = TRUE)
By plotting a scree plot of the variation percentage, we can display how much variation a principal component captures from the original data.
pca.var <- pca$sdev^2 # sdev calculates variation each PC accounts for
pca.var.per <- round(pca.var/sum(pca.var)*100, 1)
# since percentages make more sense then normal variation values
# calculate % or variation, which is much more interesing
plot(pca.var.per[1:10], main = "Scree plot", type = "b", xlab = "Principal Components", ylab = "% variation")
plot(cumsum(pca.var.per[1:15]), main = "cumulative variation", type = "l", xlab = "Principal Components", ylab = "% variation")
pca.data <- data.frame(pca$x)
rownames(pca.data) <- gsub(x = rownames(pca.data), pattern = "X786", replacement = "786")
pca.data <- cbind(sample =rownames(pca.data), pca.data)
As seen in the scree plot, again the first three or four PCs have capture most of the information since the curve bends at an “elbow” at PC4 - this is our cutting off point.
In order to color our clusters according to sample annotations, we created a color vector by dissecting the metadata matrix as well as the cell-line-annotation matrix by our PCA data samples.
### Metadata matrix for coloring
metad.cl <- subset(Metadata, sample %in% intersect(Metadata$sample, pca.data$sample))
metad.cl$msi <- Cellline_Annotation$Microsatellite_instability_status[match(metad.cl$cell, Cellline_Annotation$Cell_Line_Name)]
metad.cl$inoculation_d <- Cellline_Annotation$Inoculation_Density[match(metad.cl$cell, Cellline_Annotation$Cell_Line_Name)]
metad.cl$doubling_time <- Cellline_Annotation$Doubling_Time[match(metad.cl$cell, Cellline_Annotation$Cell_Line_Name)]
metad.cl$cancer_type <- Cellline_Annotation$Cancer_type[match(metad.cl$cell, Cellline_Annotation$Cell_Line_Name)]
Subsequently we created the color vectors for coloring according to drug and microsatellite instability status.
#color vectors for coloring by msi and tissue
colormsi <- brewer.pal(3, "Set1")
color_msi = colormsi[metad.cl$msi]
msi <- levels(metad.cl$msi)
tscol <- c("springgreen", "olivedrab4", "yellow2", "orange", "red2", "orchid1", "purple", "turquoise2", "navy")
color_tissue = tscol[metad.cl$tissue]
tissue <- levels(metad.cl$tissue)
To determine clusters of samples based on their similarity, we plotted PC1 and PC2 as well as a PC2 and PC3 while coloring by microsatellite instability status and drug respectively.
par(mfrow=c(2,2), cex=0.7, oma=c(2,3,2,2))
par(mar=c(4, 6, 3, 3))
## colored by msi
#plot PC1 and PC2
msi12 <- plot(pca$x[,1],
pca$x[,2],
col = color_msi,
pch = 19,
xlab = paste("PC1 (",pca.var.per[1],"%)"),
ylab = paste("PC2 (",pca.var.per[2],"%)"))
#create legend
legend(450, 0,
legend = msi,
col = colormsi,
pch = 19,
xpd = "FALSE",
bty = "n",
cex = 0.75
)
#create title
mtext("PCA of Fold Change colored by MSI",
side = 3,
line = -2,
cex = 1.2,
font = 2,
outer = TRUE)
#plot PC2 and PC3
msi23 <- plot(pca$x[,2],
pca$x[,3],
col = color_msi,
pch = 19,
xlab = paste("PC2 (",pca.var.per[2],"%)"),
ylab = paste("PC3 (",pca.var.per[3],"%)"))
##colored by tissue
#plot PC1 and PC2
tissue12 <- plot(pca$x[,1],
pca$x[,2],
col = color_tissue,
pch = 19,
xlab = paste("PC1 (",pca.var.per[1],"%)"),
ylab = paste("PC2 (",pca.var.per[2],"%)"))
#create legend
legend(450, 0,
legend = tissue,
col = tscol,
pch = 19,
xpd = "TRUE",
bty = "n",
cex = 0.75
)
#create title
mtext("PCA of Fold Change colored by tissue",
side = 3,
line = -16,
cex = 1.2,
font = 2,
outer = TRUE)
#plot PC2 and PC3
tissue23 <- plot(pca$x[,2],
pca$x[,3],
col = color_tissue,
pch = 19,
xlab = paste("PC2 (",pca.var.per[2],"%)"),
ylab = paste("PC3 (",pca.var.per[3],"%)"))
As we can see, rarely any clusters can be determined. In Addition, the color annotation does not reveal any pattern of clustering.
Modeling of a linear relationship between cell doubling time and inoculation density.
# Spearman correlation
cor(Cellline_Annotation$Doubling_Time, Cellline_Annotation$Inoculation_Density, method = "spearman")
## [1] 0.5674381
plot(Cellline_Annotation$Doubling_Time, Cellline_Annotation$Inoculation_Density, pch= 16, col= "blue", main = "Spearman correlation between Doubling Time and Inoculation Density", xlab = "Doubling Time", ylab = "Inoculation Density")
lm(Cellline_Annotation$Inoculation_Density~ Cellline_Annotation$Doubling_Time)
##
## Call:
## lm(formula = Cellline_Annotation$Inoculation_Density ~ Cellline_Annotation$Doubling_Time)
##
## Coefficients:
## (Intercept) Cellline_Annotation$Doubling_Time
## 9110.9 169.4
abline(lm(Cellline_Annotation$Inoculation_Density ~ Cellline_Annotation$Doubling_Time), col = "red", lwd = 2)
As you can see from the plot, there is a positive correlation between inoculation density and doubling time. This means that the higher the inoculation density, the higher the doubling time, which suggests that the tumour grows slowly because the cells need more time to divide.
As mentioned in our presentation, we want to create a model to predict GI50-values and thus to predict, if Lapatinib is a good choice. Therefore, we took several datasets and performed three regression analyses.
The first linear model tries to predict the GI-50 value among the data of the Inoculation density.
The second linear model tries to predict the GI-50 value among the data of the Doubling-time.
Finally, we did a multiple regression with both datasets to predict GI50-values.
The following table shows important values.
| Inoculation_Density | Doublingtime | Multiple regression | |
|---|---|---|---|
| R-squared-value | 0.0013505 | 0.0913649 | 0.0044753 |
| F-statistic | 0.0622076 | 4.6253841 | 0.0921552 |
| RMSE of test-dataset | 0.5192093 | 0.6574516 | 0.7987091 |
The data is hard to discuss, because we used random values for our test and training dataset. Hence, different values on every run are obtained. We could use “selected values” (First 20) but this would be kind of cheating.
As we read in the paper “Antitumor and antiangiogenic effect of the dual EGFR and HER-2 tyrosine kinase inhibitor lapatinib in a lung cancer model. (Diaz et al., 2010)”, Lapatinib and Erlotinib are said to have similar effects. To verify this, we first looked at the correlation of the drugs and then created a graph showing the difference between GI50 for a certain cell line and the middle GI50 for both drugs.
cor = cor(NLGI50.all)
pheatmap(cor, col = cm.colors(256), border_color = NA, main = "Correlation GI50")
Erlotinib and Lapatinib have a strong correlation.
#differece
diff = data.frame(erlotinib = NLGI50.all$erlotinib - mean(NLGI50.all$erlotinib), lapatinib = NLGI50.all$lapatinib- mean(NLGI50.all$lapatinib))
diff$celllines = rownames(NLGI50.all)
#create vector to insert column tissue from Metadata
tissue = sapply(1:nrow(diff), function(x) {
position = which(as.character(Metadata$cell) == diff[x, "celllines"])[1] #if tissue occurs several times, take the first
out = as.character(Metadata[position, "tissue"]) #output the tissue at this position
return(out)
})
diff$tissue = tissue
rm(tissue)
diff$celllines = factor(diff$celllines, levels = diff$celllines[order(diff$tissue)]) #Classified by tissue
e = ggplot(diff, aes(x = celllines, y = erlotinib, fill = tissue))+
geom_bar(stat = "identity") +
coord_flip() +
labs(title = "Mean graph plot of GI50 values for Erlotinib")+
theme(plot.title = element_text(hjust = 0.5, size = 19),
axis.title = element_text(size=17))
l = ggplot(diff, aes(x = celllines, y = lapatinib, fill = tissue)) +
geom_bar(stat="identity") +
coord_flip() +
labs(title="Mean graph plot of GI50 values for Lapatinib")+
theme(plot.title = element_text(hjust = 0.5, size = 19),
axis.title = element_text(size=17)) #change size and center title
ggarrange(e, l,
labels = c("A", "B"),
ncol = 2, nrow = 1)
rm(e,l)
The difference from the GI50 for a particular cell line and the mean GI50 is plotted here for Erlotinib and Lapatinib.
cor(NLGI50.all$erlotinib, NLGI50.all$lapatinib, method = "pearson")
## [1] 0.6528188
A Pearson correlation coefficient of ~ 0.65 confirms that these patterns are similar. One reason for this is that both Erlotinib and Lapatinib are EGFR inhibitors. Cell lines that were more sensitive are displayed as bars that project to the right of the mean. Cell lines that were less sensitive are displayed with bars projected to the left.
To answer our more specific question, whether Lapatinib also has an effect on lung cancer, we will only look at the cell lines in lung cancer.
#only lung with mean all
### load data
Metadata_Lapatinib_treated = Metadata[which(Metadata$drug == "lapatinib" & Metadata$dose != "0nM"),]
NegLogGI50 = as.data.frame(readRDS(paste0(wd, "/Data/NegLogGI50.rds")))
#lung cells from Metadata
Lung_Metadata_L_treated = Metadata[which(Metadata$drug == "lapatinib" & Metadata$dose != "0nM" & Metadata$tissue == "Lung"),]
celllines = Lung_Metadata_L_treated$cell
NegLogGI50.lung = as.data.frame(t(NegLogGI50[c("erlotinib", "lapatinib"), celllines]))
#Difference
dif.NegLogGI50.lung = data.frame(erlotinib = NegLogGI50.lung$erlotinib - mean(NLGI50.all$erlotinib), lapatinib = NegLogGI50.lung$lapatinib - mean(NLGI50.all$lapatinib)) #erlotinib data - mean value, lapatinib data - mean value
dif.NegLogGI50.lung$celllines = rownames(NegLogGI50.lung)
# PLot Erlotinib and Lapatinib
e = ggplot(dif.NegLogGI50.lung,aes(x = celllines, y = erlotinib)) + geom_bar(stat = "identity", fill = "skyblue") + geom_text(aes(label=round(erlotinib, 2)), vjust = -0.5, color = "black", size = 3) + coord_flip() + labs(title = "Mean graph plot of GI50 values for Erlotinib, only lung genes")+
theme(plot.title = element_text(size = 15),
axis.title = element_text(size=15))
l = ggplot(dif.NegLogGI50.lung,aes(x = celllines, y = lapatinib)) + geom_bar(stat = "identity", fill = "skyblue") + geom_text(aes(label=round(lapatinib, 2)), vjust = -0.5, color = "black", size = 3) + coord_flip() + labs(title = "Mean graph plot of GI50 values for Lapatinib, only lung genes")+
theme(plot.title = element_text(size = 15),
axis.title = element_text(size=15))
ggarrange(e, l,
labels = c("A", "B"),
ncol = 2, nrow = 1)
rm(e,l)
The difference from the GI50 for cell line from lung tissue and the mean GI50 is plotted here for Erlotinib and Lapatinib.
#correlation lung
cor(NegLogGI50.lung$erlotinib, NegLogGI50.lung$lapatinib, method = "pearson")
## [1] 0.9609488
A pearson correlation coefficent of ~ 0.96 suggests that Lapatinib has a similar effect on lung cancer as Erlotinib.
lapa<-data.frame(Metadata[which(Metadata[,'drug'] == "lapatinib"), ])
erlo<-data.frame(Metadata[which(Metadata[,'drug'] == "erlotinib"), ])
el<-right_join(lapa,erlo, by="cell")
rmv.rows = apply(el, 1, function(x) {
sum(is.na(x))
}) # Go through each row and sum up all missing values
row.names(rmv.rows)
fc<-data.frame(Treated-Untreated)
all<-data.frame(fc[grep("lapatinib|erlotinib", colnames(fc))])
all.rmv<-data.frame(all %>% dplyr::select(
-"CCRF.CEM_erlotinib_10000nM_24h",
-"HL.60_erlotinib_10000nM_24h",
-"HT29_erlotinib_10000nM_24h",
-"K.562_erlotinib_10000nM_24h",
-"LOX_erlotinib_10000nM_24h",
-"SR_erlotinib_10000nM_24h",
-"COLO.205_lapatinib_10000nM_24h"))
la<-data.frame(all.rmv[grep("lapatinib", colnames(all.rmv))])
er<-data.frame(all.rmv[grep("erlotinib", colnames(all.rmv))])
erla<-data.frame(er,la)
Drug<-c(rep('Erlotinib',53), rep('Lapatinib',53))
log2FoldChange<-apply(erla, MARGIN = 2, FUN = mean)
df_drug<-data.frame(log2FoldChange, Drug)
qqnorm(log2FoldChange, pch = 1, frame = FALSE)
qqline(log2FoldChange, col = "steelblue", lwd = 2)
fligner.test(log2FoldChange ~ Drug, data=df_drug )
##
## Fligner-Killeen test of homogeneity of variances
##
## data: log2FoldChange by Drug
## Fligner-Killeen:med chi-squared = 0.75387, df = 1, p-value =
## 0.3853
ggboxplot (data = df_drug, x="Drug", y="log2FoldChange", color = "Drug",
add = "jitter", legend = "none")+
rotate_x_text(angle = 45)+
stat_compare_means(method = "anova", label.x = 2)+ # Add global anova p-value
stat_compare_means(label = "p.signif", method = "t.test",
ref.group = ".all.", hide.ns = TRUE) # Pairwise comparison against all
As the paper “Lapatinib Distribution in HER2 Overexpressing Experimental Brain Metastases of Breast Cancer” (Taskar et.al., 2012) suggests, Lapatinb crosses the brain-blood barrier. Since breast cancer tends to spread and form metastases in brain cells and Lapatinib is used in anti-breast cancer therapy, we wanted to analyse our data for similar effects after Lapatinib treatment in breast and cns tissue cells.
At first, we selected Lapatinib response genes by creating matrices with cns- and breast-cell line fold change values.
L_fc <- select(Fold_Change, contains("Lapa"))
L_fc <- as.data.frame(t(L_fc))
rownames(Metadata) <- Metadata$sample
L_treated <- select(Treated, contains("Lapa"))
L_treated <- t(L_treated)
L_untreated <- select(Untreated, contains("Lapa"))
L_untreated <- t(L_untreated)
# selecting breast Lapatinib samples
breast <- Metadata[Metadata[,'tissue']=="Breast",]
rownames(breast) <- breast$sample
rownames(breast) <- gsub(x = rownames(breast), pattern = "-", replacement = ".")
breastFC <- subset(L_fc, rownames(L_fc) %in% rownames(breast))
breastTreated <- subset(L_treated, rownames(L_treated) %in% rownames(breast))
breastUntreated <- subset(L_untreated, rownames(L_untreated) %in% rownames(breast))#
# selecting CNS Lapatinib samples
cns <- Metadata[Metadata[,'tissue']=="CNS",]
rownames(cns) <- cns$sample
rownames(cns) <- gsub(x = rownames(cns), pattern = "-", replacement = ".")
cnsFC <- subset(L_fc, rownames(L_fc) %in% rownames(cns))
cnsTreated <- subset(L_treated, rownames(L_treated) %in% rownames(cns))
cnsUntreated <- subset(L_untreated, rownames(L_untreated) %in% rownames(cns))
Next, we performed a paired t-test of treated and untreated cns and breast samples, respectively, to determine statistical significant fold change values. Since we are performing multiple testing over different cell lines from breast and cns tissues, we want to adjust our p value by performing a Benjamini Hochberg correction.
#performing a paired t-test of treated and untreated samples
t_test_cns <- col_t_paired(cnsTreated, cnsUntreated, alternative = "two.sided", mu = 0,conf.level = 0.95)
t_test_breast <- col_t_paired(breastTreated, breastUntreated, alternative = "two.sided", mu = 0,conf.level = 0.95)
#obtaining Benjamini-Hochberg adjusted p-values
pval_cns <- t_test_cns$pvalue
pval_breast <- t_test_breast$pvalue
fdr_cns <- p.adjust(pval_cns, "BH")
fdr_breast <- p.adjust(pval_breast, "BH")
#obtaining mean FC values over all samples
breastFCm <- as.numeric(colMeans(breastFC))
cnsFCm <- as.numeric(colMeans(cnsFC))
genes <- colnames(breastFC)
For visualization of the biological significance (fold change values) and the statistical significance (adjusted p-values), we plotted our data in an interactive volcano plot.
## breast volcano plot
#creating a matrix containg all needed values for plotting
diff_df_breast <- data.frame(gene = genes, Fold = breastFCm, FDR = fdr_breast)
diff_df_breast$absFold <- abs(diff_df_breast$Fold)
# add a grouping column; default value is "not significant"
diff_df_breast$group <- "NotSignificant"
# change the grouping for the entries with significance but not a large enough Fold change
diff_df_breast[which(diff_df_breast['FDR'] < 0.5 & (diff_df_breast['absFold']) < 1 ),"group"] <- "Significant"
# change the grouping for the entries a large enough Fold change but not a low enough p value
diff_df_breast[which(diff_df_breast['FDR'] > 0.5 & (diff_df_breast['absFold']) > 1 ),"group"] <- "FoldChange"
# change the grouping for the entries with both significance and large enough fold change
diff_df_breast[which(diff_df_breast['FDR'] < 0.5 & (diff_df_breast['absFold']) > 1 ),"group"] <- "Significant&FoldChange"
# Find and label the top peaks.
top_peaks_breast <- diff_df_breast[with(diff_df_breast, order(Fold, FDR)),][1:10,]
top_peaks_breast <- rbind(top_peaks_breast, diff_df_breast[with(diff_df_breast, order(-Fold, FDR)),][1:10,])
# Add gene labels for all of the top genes we found
# creating an empty list, and filling it with entries for each row in the dataframe
# each list entry is another list with named items that will be used
a <- list()
for (i in seq_len(nrow(top_peaks_breast))) {
m <- top_peaks_breast[i, ]
a[[i]] <- list(
x = m[["Fold"]],
y = -log10(m[["FDR"]]),
text = m[["gene"]],
xref = "x",
yref = "y",
showarrow = TRUE,
arrowhead = 0.5,
ax = 20,
ay = -40
)
}
plot_breast <- plot_ly(data = diff_df_breast, x = diff_df_breast$Fold, y = -log10(diff_df_breast$FDR), type = "scatter", text = diff_df_breast$gene, mode = "markers", color = diff_df_breast$group) %>%
layout(title ="Volcano Plot of Lapatinib breast cancer samples",
xaxis = list(title="log2 Fold Change"),
yaxis = list(title="FDR")) %>%
layout(annotations = a)
plot_breast
## CNS volcano plot
diff_df_cns <- data.frame(gene = genes, Fold = cnsFCm, FDR = fdr_cns)
diff_df_cns$absFold <- abs(diff_df_cns$Fold)
# add a grouping column; default value is "not significant"
diff_df_cns$group <- "NotSignificant"
# change the grouping for the entries with significance but not a large enough Fold change
diff_df_cns[which(diff_df_cns['FDR'] < 0.5 & (diff_df_cns['absFold']) < 0.5 ),"group"] <- "Significant"
# change the grouping for the entries a large enough Fold change but not a low enough p value
diff_df_cns[which(diff_df_cns['FDR'] > 0.5 & (diff_df_cns['absFold']) > 0.5 ),"group"] <- "FoldChange"
# change the grouping for the entries with both significance and large enough fold change
diff_df_cns[which(diff_df_cns['FDR'] < 0.5 & (diff_df_cns['absFold']) > 0.5 ),"group"] <- "Significant&FoldChange"
# Find and label the top peaks
top_peaks_cns <- diff_df_cns[with(diff_df_cns, order(Fold, FDR)),][1:10,]
top_peaks_cns <- rbind(top_peaks_cns, diff_df_cns[with(diff_df_cns, order(-Fold, FDR)),][1:10,])
a <- list()
for (i in seq_len(nrow(top_peaks_cns))) {
m <- top_peaks_cns[i, ]
a[[i]] <- list(
x = m[["Fold"]],
y = -log10(m[["FDR"]]),
text = m[["gene"]],
xref = "x",
yref = "y",
showarrow = TRUE,
arrowhead = 0.5,
ax = 20,
ay = -40
)
}
plot_cns <- plot_ly(data = diff_df_cns, x = diff_df_cns$Fold, y = -log10(diff_df_cns$FDR),type = "scatter", text = diff_df_cns$gene, mode = "markers", color = diff_df_cns$group) %>%
layout(title ="Volcano Plot of Lapatinib CNS cancer samples",
xaxis = list(title="log2 Fold Change"),
yaxis = list(title="FDR"))%>%
layout(annotations = a)
plot_cns
To visualize the expression patterns of the top peak genes present in both tissue types we calculated a heatmap, comparing the expression profiles of the two tissue types.
# selecet top peak genes common in cns and breast tissue
tpb_comparison <- subset(top_peaks_breast, gene %in% top_peaks_cns$gene)
tpc_comparison <- subset(top_peaks_cns, gene %in% top_peaks_breast$gene)
# order common genes alphabetically
tpb_comparison <- tpb_comparison[order(tpb_comparison$gene),]
tpc_comparison <- tpc_comparison[order(tpc_comparison$gene),]
## creating heat map of FCs to compare values
cor_mat <- cbind("breast" = tpb_comparison$Fold, "cns" = tpc_comparison$Fold)
rownames(cor_mat) <- tpb_comparison$gene
data <- read.delim
pheatmap(
mat = cor_mat,
color = magma(10),
border_color = NA,
show_colnames = TRUE,
show_rownames = TRUE,
drop_levels = TRUE,
fontsize = 14,
main = "Comparison:
FC levels of cns and breast top peak genes"
)
For better visualization of the expression patterns of the top peak genes among breast and cns tissues we computed a heatmap of the mean fold changes for direct comparison.
#correlation between top peak gene expression patterns in breast and cns tissues treated by lapatinib
cor_mat <- as.data.frame(cor_mat)
dif.FC.BC = data.frame(breast = cor_mat$breast - mean(t(breastFC)), cns = cor_mat$cns - mean(t(cnsFC))) #breast data - mean value, cns data - mean value
dif.FC.BC$gene = rownames(cor_mat)
The heatmap shows, that the top expression genes are regulated similarly in breast and in cns cells after treatment with Lapatinib.
To gather a more detailed conception of the gene expression regulation values (aka the fold change) we also computed a comparing bar plot showing the exact fold change values.
# PLot
b = ggplot(dif.FC.BC,aes(x = gene, y = breast)) +
geom_bar(stat = "identity", fill = "skyblue") +
geom_text(aes(label = round(breast, 2)), vjust = -0.5, color = "black", size = 3) +
coord_flip() + labs(title = "Mean of Fold Change for breast top peak genes")+
theme(plot.title = element_text(hjust = 0.5,size = 15),
axis.title = element_text(size = 15))
c = ggplot(dif.FC.BC,aes(x = gene, y = cns)) +
geom_bar(stat = "identity", fill = "skyblue") +
geom_text(aes(label = round(cns, 2)), vjust = -0.5, color = "black", size = 3) +
coord_flip() + labs(title = "Mean of Fold Change for CNS top peak genes")+
theme(plot.title = element_text(hjust = 0.5, size = 15),
axis.title = element_text(size = 15))
ggarrange(b, c,
labels = c("A", "B"),
ncol = 2, nrow = 1)
rm(b,c)
At last, we investigated the expression regulation in both tissue types by calculating the pearson correlation between the fold change values of the top peak genes.
#correlation
cor(cor_mat$breast, cor_mat$cns, method = "pearson")
## [1] 0.921812
The obtained correlation value of ~ 0.92 indicates that Lapatinib treatment yields a similar effect on gene expression in breast and cns tissue cells and therefore might be possible to treat cancer in brain metastases.